Reverse-engineering a model's computation — reading what it does, not just what it says
Day 36 of 60
Week 7 left you with an uncomfortable conclusion: if a model can fake alignment — behave well while observed and defect when it isn't — then black-box testing can never fully clear it. You can run a thousand evals and still not know whether the model is safe or just knows it's being tested. That is the wall behavioral safety hits. Mechanistic interpretability is the field's most ambitious attempt to climb over it: instead of judging the model by its outputs, you reverse-engineer the computation inside and read what it's actually doing.
Interpretability treats a trained network not as a black box but as a program written in weights — one nobody wrote on purpose and nobody has read. The goal is to decompile it: identify the internal features and circuits the model uses, so that "is it deceptive?" becomes a question you can answer by inspection, not just by hoping the test caught it. It is the closest thing the field has to a lie detector for models.
The honest framing matters from the first sentence: this is a research program, not a finished tool. By the end of the week you'll be able to explain both what interpretability can already do and exactly where it still falls short — and that pairing is what makes you credible instead of breathless.
"Interpretability" is an overloaded word. Practitioners distinguish kinds, and mechanistic interp is a specific, demanding one.
Probe inputs and watch outputs: feature attributions, saliency maps, "the model said X because the prompt contained Y." Useful, but it never opens the box — it correlates inputs with outputs and can be fooled by a model that behaves differently when watched.
Ask what information is present in the model's internal activations — can a concept be read off a hidden layer? This is the level of linear probes (Day 38). It tells you a concept is encoded; it doesn't tell you the model uses it.
The ambitious one: identify the actual features (directions in activation space that mean something) and circuits (subgraphs of components that compute something) the model runs, and show how they combine to produce a behavior. This is reverse-engineering, not correlation — and it's what could, in principle, verify internals.
Black-box interp asks "what does it do?" Representational interp asks "what does it know?" Mechanistic interp asks "how does it actually compute this, step by step?" Only the third gives you something that could survive a model trying to deceive you.
The founding move of modern interpretability was to stop treating a transformer as an inscrutable matrix and instead trace specific circuits — small, identifiable pieces of computation. The landmark result is the induction head: a two-attention-head circuit that implements a simple but powerful rule — "I saw the pattern [A][B] earlier; now I'm seeing [A] again, so predict [B]." It's how models do basic in-context pattern completion, and it was found, named, and mechanistically explained — proof that the inside of a network is at least sometimes legible.
If a behavior is implemented by an identifiable circuit, you can in principle ask whether a deception behavior has one too — and watch for it firing. That is the whole bet: turn "we tested and it seemed fine" into "we looked inside and here's the mechanism." Today you only need the intuition that circuits exist and can be read; the rest of the week builds on it.
An enthusiast says "interpretability lets us understand models." An expert names the altitude: behavioral explanation correlates inputs with outputs and can be gamed; mechanistic interpretability reverse-engineers the actual circuit, which is the only kind of evidence that could survive a model trying to look safe. Knowing which kind of "understanding" you're claiming is the whole credibility gap.
Say this in an interview: "Mechanistic interpretability isn't feature attribution — it's reverse-engineering the circuits a model actually runs, like induction heads. I care about it specifically because if deceptive alignment is real, behavioral evals can't clear a model on their own, and reading internals is the only path to verification rather than hope."